Testing times: disentangling admixture histories in recent and complex demographies using ancient DNA

Abstract Our knowledge of human evolutionary history has been greatly advanced by paleogenomics. Since the 2020s, the study of ancient DNA has increasingly focused on reconstructing the recent past. However, the accuracy of paleogenomic methods in resolving questions of historical and archaeological importance amidst the increased demographic complexity and decreased genetic differentiation remains an open question. We evaluated the performance and behavior of two commonly used methods, qpAdm and the f3-statistic, on admixture inference under a diversity of demographic models and data conditions. We performed two complementary simulation approaches—firstly exploring a wide demographic parameter space under four simple demographic models of varying complexities and configurations using branch-length data from two chromosomes—and secondly, we analyzed a model of Eurasian history composed of 59 populations using whole-genome data modified with ancient DNA conditions such as SNP ascertainment, data missingness, and pseudohaploidization. We observe that population differentiation is the primary factor driving qpAdm performance. Notably, while complex gene flow histories influence which models are classified as plausible, they do not reduce overall performance. Under conditions reflective of the historical period, qpAdm most frequently identifies the true model as plausible among a small candidate set of closely related populations. To increase the utility for resolving fine-scaled hypotheses, we provide a heuristic for further distinguishing between candidate models that incorporates qpAdm model P-values and f3-statistics. Finally, we demonstrate a significant performance increase for qpAdm using whole-genome branch-length f2-statistics, highlighting the potential for improved demographic inference that could be achieved with future advancements in f-statistic estimations.

I don't have major complaints about the methodology or statistical design of the manuscript.That said, I do have several comments which I believe should be addressed in order to make the conclusions and lessons presented by the manuscript clearer to the aDNA readership and easier to put into the context of empirical aDNA work and into practical use.
Major comments ============= lines 90-91: "We started by simulating two chromosomes of combined length ~ 491 Mbp under four simplistic and qualitatively different admixture graphs" Why have the authors simulated only two chromosomes of 491 Mb in total when in the second stage, for the complex model, they do simulate truly genome-scale data (so computational cost isn't the issue)?
My question is motivated by this: If I simulate replicates of a 1Mb, 10Mb, 100Mb chromosome and compute (for instance) an f4 statistic to test an admixture hypothesis, I will get f4 values which will be less noisy the more sequence is simulated (and more accurate admixture inferences).I would expect the same should apply even for qpAdm, with an increase in power with larger sequence lengths, purely by reducing statistical noise?
In order to make the results more comparable to the behavior of real data (and between the two simulation studies in the manuscript), wouldn't it be more accurate to simulate replicates from a single simulation of all 22 autosomes for each topology (about 3000 Mb genomes)?It's not immediately obvious to me whether the f-statistics/qpAdm noisiness expected in real data would be otherwise comparable to these 491 Mb simulations.I find this decision even more surprising because on line 95 the authors write that they simulated truly whole-genome simulations for their complex models.So why not make the simulation setups the same across both simulation studies?--caption of Figure 1B: Why have the authors opted to focus on Fst values specifically in ancient Southwest Asia?Given the interesting results showing the relationship between qpAdm power and population differentiation throughout the manuscript, why not show Fst from the entire AADR panel?In fact, this could be even partitioned into panels showing this style of age-vs-Fst plots across time periods in different geographic regions (Europe, Southwest Asia, etc.?).This would make it much easier for a reader to put results such as those in Figure 3 into a wider context, including Europe where majority of aDNA is still coming from.
On a related note, while referring to the same Figure 1B on lines 225-227 the authors talk about "ancient southwest Eurasian groups".Then again, line 241 talks again about "Southwest Asians".This makes me think it should be one or the other in all of those instances.
Reading this made me again think about the effect of the length of the genome simulated, and whether simulating sequence spanning all autosomes (and not just 491 Mb) as mentioned in my comment above would change the proportion of significant models.If the proportion of plausible models is expected to change with genome-scale data, I'm not sure how translatable the described results are towards running qpAdm on empirical data which is, of course, genome-wide.
--lines 556-577: This paragraph describing a complex simulated model would be incomparably easier to read if it were accompanied by a population phylogenetic tree (or admixture graph) directly in the main text.After all, the model represents a second major case study in the manuscript.It would also make it immediately possible to compare this model to models in the first study shown in Figure 3 and see differences in topological complexity.
--lines 182 and 617: The authors mention using 5 Mb block size for performing the jackknife.
How does the size of genome blocks used for jackknifing affect the results of a qpAdm analysis?If admixture is recent, ancestry haplotypes (and extend of linkage between admixed SNPs in a target) will be larger, so blocks overlapping those haplotypes will be entirely formed by SNPs linked on those haplotypes.If admixture is ancient, ancestry haplotypes will be shorter, which will also show at the level of blocks (depending on the block sizes).As such, does the block size have any influence on qpAdm inferences, depending on the age of admixture being modelled?If there is such an effect, perhaps showing a simple figure demonstrating p-values and inferred admixture proportions as a function of block size would be useful.If there's not an effect and this is of no concern at all, a reference to the relevant literature would be helpful.
---Given that both submitted qpAdm manuscript were apparently worked on in tandem, it would be useful for readers to know if this manuscript follows the advice from the bigger companion paper, in particular with respect to separating (or not) target/left/right sets based on sampling time, etc (i.e.advice that's stated in the "Best practices" list of Yüncü el.).Stating the relationship of the simulation setup of this manuscript to the best practices recommended by the companion study would help to put results and lessons in this paper into the context of the other study. --- The authors recommend f3 as a confirmation and ranking tool for plausible qpAdm models.The companion paper recommends using ADMIXTURE.Should both be used where applicable?How does increasing power with f3 help compared to ADMIXTURE and vice versa?
Minor comments ============= lines 21-22: "under conditions resembling historical populations" I assume "aDNA conditions" are meant here given the previous sentence, but those don't really apply to "populations".I suggest rephrasing this sentence to clarify what's intended to be said here.
--line 171-173: A sentence or two describing why the two-phase simulation approach combining DTWF and standard coalescence is necessary might be helpful for readers, particularly if they were interested in doing their own specific simulation follow up analyses.
---Figure 3: I find the text in the legends on the bottom of the figure a little confusing and it took me a while to decipher it just by looking at the figure: I would suggest expanding the legend belonging to panels I-L a bit more, perhaps with an explicit/longer text such as "Prob. of QTP-binary (assuming qpAdm p-value >= 0.05 & admixture proportions [0, 1])".Or, if this would be deemed to long, something slightly more explicit because the current "(QTP -binary) p>=0.05& weights [0:1]" is too terse.I also suggest replacing "N: FP qpAdm models" with more conventional "# N of qpAdm models" in the equation using the convention of # to represent counts (if this is indeed what is intended by "N:").
--line 424: When the authors write "subtle and weakly significant", do they talk about effect size?Otherwise, at a given significance level, something either is or isn't significant; talking about "weak significance" seems vague.I wasn't sure how to interpret this statement.Clarifying it with more explicit and rigorous language would help.
--line 464: The terms FPR and FDR appear to be first mentioned here but are not formally defined above in the context of admixture graphs.On line 481 a FPR definition is stated but it's not clear upon initial reading whether this definition is applicable to FPR discussed in the previous two paragraphs above this line.
Also, for formal reasons, the first instances of FPR and FDR on line 464 should be spelled out in their full non-abbreviated forms regardless of how obvious and generally known they are.
---I was slightly surprised by seeing notation [x:y] indicating that a number lies between x and y throughout the manuscript.Perhaps this is inspired by R vector syntax or something similar but why not use a more common notation for intervals in mathematics: [x, y] for closed intervals and (x, y) for open intervals, and combinations of both for semi-closed intervals?---On a related note, the Fst-bin columns in Figure 4 share right or left Fst boundaries, respectively (0.008 and 0.013) but it's not clear which boundary is included in which panel.Again, using semi-closed interval notation will make this unambiguous.
1st Accepted Version -Authors' Response to Reviewers: April 8, 2024 Reply to Reviewer Testing Times: Disentangling Admixture Histories in Recent and Complex Demographies using ancient DNA Dear anonymous reviewer, We sincerely thank you for your careful reading and review of our companion manuscripts.We believe your comments have substantially improved the translatability of our results for the ancient DNA (aDNA) community, in addition to strengthening a number of our inferences.In addressing your concerns, we have generated three additional supplementary figures and modified two main figures in accordance with your suggestions (see below).
We respond to each of the reviewers' comments separately, splitting each section with a horizontal line to increase readability.To further increase readability, we have enclosed the reviewers' comments with quotation marks ("..."), formatted our response text in bold lettering font, and included any manuscript text that was extensively modified in blue color font with associated line numbers L:X-Y taken from the updated manuscript.
Sincerely, Matthew Williams on behalf of the authors.
"lines 90-91: "We started by simulating two chromosomes of combined length ~491 Mbp under four simplistic and qualitatively different admixture graphs".Why have the authors simulated only two chromosomes of 491 Mb in total when in the second stage, for the complex model, they do simulate truly genome-scale data (so computational cost isn't the issue)?My question is motivated by this: If I simulate replicates of a 1Mb, 10Mb, 100Mb chromosome and compute (for instance) an f4 statistic to test an admixture hypothesis, I will get f4 values which will be less noisy the more sequence is simulated (and more accurate admixture inferences).I would expect the same should apply even for qpAdm, with an increase in power with larger sequence lengths, purely by reducing statistical noise?In order to make the results more comparable to the behavior of real data (and between the two simulation studies in the manuscript), wouldn't it be more accurate to simulate replicates from a single simulation of all 22 autosomes for each topology (about 3000 Mb genomes)?It's not immediately obvious to me whether the f-statistics/qpAdm noisiness expected in real data would be otherwise comparable to these 491 Mb simulations.I find this decision even more surprising because on line 95 the authors write that they simulated truly whole-genome simulations for their complex models.So why not make the simulation setups the same across both simulation studies?" We concur with the reviewer's intuition that 1) enhancements in qpAdm performance are likely to occur as the standard errors of the underlying f-statistics decrease, and 2) as genome sizes increase, the standard errors are expected to decrease.Furthermore, we acknowledge that our current manuscript does not effectively convey how our findings from the simple model, which uses 491Mb genome simulations, can be applied to "real data conditions" or our complex model simulations that incorporate whole-genome with aDNA conditions.
Firstly, while simulation software like msprime and the associated tree-sequence formats have significantly reduced the computational cost of simulations, executing human whole-genome simulations of approximately 3Gb remains quite resource-intensive.Consequently, we lack the resources to conduct 5,000 simulations for four simple models and were limited to 50 replicates of a single complex model.To enhance the clarity of our manuscript, we have added text detailing the computational cost of our whole-genome 1st Accepted Version -Authors' Response to Reviewers: April 8, 2024 simulations on lines 611-612 of the Results section for the complex simulations: Our whole-genome simulations required an average of 12.6±2.2GB of memory and a duration time of 221±10 hours per-replicate.
Additionally, we have added new text to the Discussion section (see below text from lines 805-814) that provides further clarity on the two approaches we implemented.It also highlights their synergistic relationship, which allowed us to evaluate admixture inference performance at two different scales of complexity.
To address the reviewers' concern regarding genome size and standard errors, we performed a subsampling of chromosomes 1 and 2 (approximately 491 Mb) from each of the 50 replicate whole-genome complex model simulations and executed qpAdm again.This re-analysis enabled us to compare the standard errors of qpAdm under the following five data quality conditions: 1. Branch length, whole genome.2. Branch length, chromosomes 1 and 2 (subsampled from the whole genome).3. aDNA, random sampling of ascertained SNPs (subsampled from the whole genome).4. aDNA, sampling of 100K to 500K ascertained SNPs (subsampled from the whole genome). 5. aDNA, sampling of less than 100K ascertained SNPs (subsampled from the whole genome).
In our revised Supplementary Figure 13 (below), we note that the standard errors calculated from the approximately 491 Mb genome are within the range of those derived from whole-genome simulations under both the random aDNA missingness sampling scheme, and when constraining the number of SNPs to be between 100K and 500K.Therefore, the insights gained from our simple simulations are directly applicable to analyses of empirical data.We have incorporated updated text into the Discussion section of the manuscript that explains these results and highlights the relevance of the simple simulations to the complex simulations and real-world data (see below text from lines 805-814).

Lines 805-814:
We addressed these questions through simulations of both simple admixture-graph-like demographies that explore a broad parameter space on two chromosomes, and whole-genome simulations of admixture-graph-like demography that reflects the inferred complexity of Eurasian population history.We note that computing qpAdm on chromosomes 1 and 2 subsetted from the 50 whole-genome simulations exhibit standard errors within the range observed under the aDNA 100k to 500k SNP and random sampling missingness approaches (SI Figure S13), demonstrating the relevance of our inferences from the simple demographic simulations for empirical aDNA analysis.Moreover, this highlights the complementarity of our approach, enabling us to simultaneously investigate a broad parameter space that is computationally infeasible under whole-genome simulations-whilst evaluating the effects of aDNA data missingness and the upper boundaries of qpAdm performance that would otherwise be obscured using smaller genomes.
1st Accepted Version -Authors' Response to Reviewers: April 8, 2024 Supplementary Figure S13 SI Figure S13.Evaluation of the qpAdm standard errors under varying simulated genome sizes and ancient DNA data missingness conditions using the complex simulation.The f2-statistic computed on chromosomes 1 and 2 was subsampled from the whole-genome simulations and qpAdm re-run to generate standard errors.
"Caption of Figure 1B: Why have the authors opted to focus on Fst values specifically in ancient Southwest Asia?Given the interesting results showing the relationship between qpAdm power and population differentiation throughout the manuscript, why not show Fst from the entire AADR panel?In fact, this could be even partitioned into panels showing this style of age-vs-Fst plots across time periods in different geographic regions (Europe, Southwest Asia, etc.?).This would make it much easier for a reader to put results such as those in Figure 3 into a wider context, including Europe where majority of aDNA is still coming from.On a related note, while referring to the same Figure 1B on lines 225-227 the authors talk about "ancient southwest Eurasian groups".Then again, line 241 talks again about "Southwest Asians".This makes me think it should be one or the other in all of those instances." We agree with the reviewer that incorporating more geographical regions will enhance the applicability of the analysis to a broader range of regions.Our focus was on ancient groups from southwest Asia, as this is the region that our complex demographic model is based on.We also acknowledge that while the number of published ancient genomes is growing rapidly, only a few regions have sufficient samples across all periods to conduct a temporal analysis of shifts in FST.Europe is one such region that the reviewer correctly identified.Therefore, we have included samples from southern-central-western Europe and the Mediterranean, in addition to the groups from southwest Asia, to illustrate the decrease in FST.Please refer to the updated Figure 1 below for the modifications.
1st Accepted Version -Authors' Response to Reviewers: April 8, 2024 In response to the author's inquiry into our use of the terms "Southwest Asians" (on original manuscript line 241) and "ancient southwest Eurasian groups" (on original manuscript lines 225-227), we thank the reviewer for their careful reading.We show below modified text from our updated manuscript where we had previously used the term "Southwest Asians" that describes below to provide more clarity for the reader in what we are referring to in each instance.We have now also made explicit reference to the group in question with respect to their geographical location.
Lines 228-230: Due to extensive admixture between ancient groups occupying regions across Eurasia beginning around the 6th 1st Accepted Version -Authors' Response to Reviewers: April 8, 2024 millennium BCE, populations from historical periods exhibit, on average, lower genetic differentiation than their predecessors (Figure 1B).
Lines 245-246: As these values approach 0.01, equivalent to empirical values observed amongst Bronze Age and older groups occupying Europe, the Mediterranean, and southwest Asia, … Lines 254-256: The smallest range, FST between 0 and 0.008, corresponds to the diversity estimated from samples dating between 1.5k to 3.2k years ago (Figure 1B) with the upper range broadly demarcating the Iron Age from the Bronze Age in southwest Asia.
Lines 659-661: Our simulations resulted in expected levels of population divergence given empirical observations with a median pairwise FST of 0.03 between all ancient groups representing Eurasian populations and 0.017 amongst those representing southwest Asian populations (Figure 6C).
"lines 242-243: "convergence on an average of two plausible qpAdm models per simulation iteration".Reading this made me again think about the effect of the length of the genome simulated, and whether simulating sequence spanning all autosomes (and not just 491 Mb) as mentioned in my comment above would change the proportion of significant models.If the proportion of plausible models is expected to change with genome-scale data, I'm not sure how translatable the described results are towards running qpAdm on empirical data which is, of course, genome-wide." We believe the comment regarding the degree to which the ~490Mb simulations are transferable to 'real empirical' ancient DNA is adequately addressed in the first reply.
"lines 556-577: This paragraph describing a complex simulated model would be incomparably easier to read if it were accompanied by a population phylogenetic tree (or admixture graph) directly in the main text.After all, the model represents a second major case study in the manuscript.It would also make it immediately possible to compare this model to models in the first study shown in Figure 3 and see differences in topological complexity." We agree with the reviewer that the inclusion of a topology of the simulated demography would greatly enhance the interpretability of the model.However, we suggest that plotting such a complicated model becomes almost as un-interpretable as glancing at the simulation parameters table (Supplementary File SF1).As such, we have generated a simplified demography topology (Figure 6A) that is restricted to showing populations that are included in plausible qpAdm models, and ancestral populations that they share either from a shared admixture event or a shared ancestor.
1st Accepted Version -Authors' Response to Reviewers: April 8, 2024 (corresponding to ancestral populations in A).An FST matrix (C) for the sampled simulated populations is also shown.(D) Barplots showing probabilities of encountering a lineage found in the "sLev IA1" group in other simulated ancestral populations (only presenting populations with non-negative probabilities).The ancestral populations are those from which we sampled and correspond to the first column in B.
"How does the size of genome blocks used for jackknifing affect the results of a qpAdm analysis?If admixture is recent, ancestry haplotypes (and extend of linkage between admixed SNPs in a target) will be larger, so blocks overlapping those haplotypes will be entirely formed by SNPs linked on those haplotypes.If admixture is ancient, ancestry haplotypes will be shorter, which will also show at the level of blocks (depending on the block sizes).As such, does the block size have any influence on qpAdm inferences, depending on the age of admixture being modelled?If there is such an effect, perhaps showing a simple figure demonstrating p-values and inferred admixture proportions as a function of block size would be useful.If there's not an effect and this is of no concern at all, a reference to the relevant literature would be helpful." We Lines 451-466: Impact of block jackknife size under recent admixture In order to quantify the uncertainty linked to the random sampling of SNPs, qpAdm employs a jackknife resampling method to calculate standard errors.Using simple demographic Model 1, we investigated how altering the block jackknife size affects the performance of qpAdm, specifically in cases of recent admixture by testing eight different block sizes ranging from 0.01 to 100 Mb across three generations since admixture bins (Tadmix ranges = (0, 50), [50, 100), and [100, max)).We find our results are consistent with Harney et al. 2021, who evaluated the impacts of block jackknife sizes on qpAdm performance from a fixed demography.Our analysis shows that qpAdm estimates of admixture proportion remain unbiased, regardless of the block size, for both recent (Tadmix = (0, 50]) and older (Tadmix = [50,100), and [100, max)) generations since admixture (SI Figure S12).Additionally, our findings show that the smallest block sizes result in the lowest standard error estimates (refer to SI Figure S11B).However, we observe a slight increase in the standard error across all block sizes under recent admixture (i.e., < 50 generations), which suggests that the use of standard errors as a qpAdm plausibility constraint might negatively impact performance under recent admixture (discussed further below).Consistent with the findings of Harney et al., the smallest and largest block sizes yield non-uniformly distributed P-values, and we do not observe changes to the P-value distribution across the three generation since admixture bins (SI Figure S11A).
appreciate the reviewer's question as it enables us to leverage our simulations under the Wright-Fisher (WF) model, which more accurately reflects long-range correlations across the genome due to recent admixture events.We conducted further analyses and incorporated the findings in a new section of the manuscript (see below).Our observations align with previous research on the performance of qpAdm under a simple demographic model by Harney et al. 2021.Moreover, we were able to show that the number of generations since admixture does not influence the distribution of p-values or the bias in admixture weights.Nevertheless, we did notice a slight increase in the standard error with a decrease in the number of generations since admixture, all of which we document below.